library(tidyverse) # for data cleaning and plotting
library(lubridate) # for date manipulation
library(openintro) # for the abbr2state() function
library(palmerpenguins)# for Palmer penguin data
library(maps) # for map data
library(ggmap) # for mapping points on maps
library(gplots) # for col2hex() function
library(RColorBrewer) # for color palettes
library(sf) # for working with spatial data
library(leaflet) # for highly customizable mapping
library(carData) # for Minneapolis police stops data
library(ggthemes) # for more themes (including theme_map())
theme_set(theme_minimal())
# Starbucks locations
Starbucks <- read_csv("https://www.macalester.edu/~ajohns24/Data/Starbucks.csv")
starbucks_us_by_state <- Starbucks %>%
filter(Country == "US") %>%
count(`State/Province`) %>%
mutate(state_name = str_to_lower(abbr2state(`State/Province`)))
# Lisa's favorite St. Paul places - example for you to create your own data
favorite_stp_by_lisa <- tibble(
place = c("Home", "Macalester College", "Adams Spanish Immersion",
"Spirit Gymnastics", "Bama & Bapa", "Now Bikes",
"Dance Spectrum", "Pizza Luce", "Brunson's"),
long = c(-93.1405743, -93.1712321, -93.1451796,
-93.1650563, -93.1542883, -93.1696608,
-93.1393172, -93.1524256, -93.0753863),
lat = c(44.950576, 44.9378965, 44.9237914,
44.9654609, 44.9295072, 44.9436813,
44.9399922, 44.9468848, 44.9700727)
)
#COVID-19 data from the New York Times
covid19 <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
If you were not able to get set up on GitHub last week, go here and get set up first. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Mapping data with R” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
ggmap)Starbucks locations to a world map. Add an aesthetic to the world map that sets the color of the points according to the ownership type. What, if anything, can you deduce from this visualization?world_map <- get_stamenmap(
bbox = c(left = -180, bottom = -57, right = 179, top = 82.1),
maptype = "terrain",
zoom = 2)
ggmap(world_map) +
geom_point(data = Starbucks,
aes(x = Longitude,
y = Latitude,
color = `Ownership Type`),
size = .2)+
theme_map() +
theme(legend.background = element_blank()) +
labs(title = "Starbucks across the world")
There are mostly Licensed and Company Owned locations in North America. In Europe and and Asia, there are a bunch of Joint venture locations, and in western Europe we see the only real Franchised locations on the map
stpaul_map <- get_stamenmap(
bbox = c(left = -94.08, bottom = 44.54, right =-92.26, top = 45.43),
maptype = "terrain",
zoom = 9)
ggmap(stpaul_map) +
geom_point(data = Starbucks,
aes(x = Longitude,
y = Latitude,
color = `Ownership Type`),
weight = 1,
alpha = .8)+
theme_map() +
theme(legend.background = element_blank()) +
labs(title = "Starbucks in the metro area")
stpaul_map <- get_stamenmap(
bbox = c(left = -94.08, bottom = 44.54, right =-92.26, top = 45.43),
maptype = "terrain",
zoom = 11)
ggmap(stpaul_map) +
geom_point(data = Starbucks,
aes(x = Longitude,
y = Latitude,
color = `Ownership Type`),
weight = 1,
alpha = .8)+
theme_map() +
theme(legend.background = element_blank()) +
labs(title = "Messing with zooms")
We see that the detail is increased, but by doing that the scale is all messed up- all the names are too small.
get_stamenmap() in help and look at maptype). Include a map with one of the other map types. stpaul_map <- get_stamenmap(
bbox = c(left = -94.08, bottom = 44.54, right =-92.26, top = 45.43),
maptype = "watercolor",
zoom = 9)
ggmap(stpaul_map) +
geom_point(data = Starbucks,
aes(x = Longitude,
y = Latitude,
color = `Ownership Type`),
weight = 1,
alpha = .8)+
theme_map() +
theme(legend.background = element_blank()) +
labs(title = "Watercolor map of Starbucks locations")
annotate() function (see ggplot2 cheatsheet). stpaul_map <- get_stamenmap(
bbox = c(left = -94.08, bottom = 44.54, right =-92.26, top = 45.43),
maptype = "terrain",
zoom = 9)
ggmap(stpaul_map) +
geom_point(data = Starbucks,
aes(x = Longitude,
y = Latitude,
color = `Ownership Type`),
weight = 1,
alpha = .8)+
theme_map() +
theme(legend.background = element_blank()) +
annotate(geom = "point", x = -93.17, y = 44.94, color = "green", label = "Macalester") +
labs(title = "Starbucks locations with Macalester marked")
##How do I get the label to work?
geom_map())The example I showed in the tutorial did not account for population of each state in the map. In the code below, a new variable is created, starbucks_per_10000, that gives the number of Starbucks per 10,000 people. It is in the starbucks_with_2018_pop_est dataset.
census_pop_est_2018 <- read_csv("https://www.dropbox.com/s/6txwv3b4ng7pepe/us_census_2018_state_pop_est.csv?dl=1") %>%
separate(state, into = c("dot","state"), extra = "merge") %>%
select(-dot) %>%
mutate(state = str_to_lower(state))
starbucks_with_2018_pop_est <-
starbucks_us_by_state %>%
left_join(census_pop_est_2018,
by = c("state_name" = "state")) %>%
mutate(starbucks_per_10000 = (n/est_pop_2018)*10000)
dplyr review: Look through the code above and describe what each line of code does.Line 1: Assigns variable and reads in spreadsheet Line 2: Removes the period from in front of the state names (that was there for some reason). Line 3: Selects new state column and population data LIne 4: Makes the states lowercase
Line 1: Creates new variable Line 2: Calls in old starbucks data Line 3: Appends the population of each state to the old data Line 4: Creates a column of starbucks for every 10000 people
US_map <- map_data("state")
starbucks_with_2018_pop_est %>%
ggplot() +
geom_map(map = US_map,
aes(map_id = state_name,
fill = starbucks_per_10000)) +
scale_fill_viridis_c() +
expand_limits(x = US_map$long, y = US_map$lat) +
theme_map() +
theme(legend.background = element_blank()) +
labs(title = "Starbucks per 10,000 people",
fill = NULL)
Other than the notable exception of Washington DC (that little tiny yellow dot), the highest starbucks densities are on the west coast. ### A few of your favorite things (leaflet)
fav_places <- tibble(
place = c("Home", "Macalester College", "Church", "Thomas's House",
"Blair's House", "Lindwood Recreation Center", "Central High School", "Ramsey Middle School", "EXPO Elementary School", "Cadenza Music", "University of Minnesota East Bank", "Mississippi River Blvd/Jefferson Ave", "Ax-Man"),
long = c(-93.1632130682622, -93.1712321, -93.16503218194325, -93.12526680731808, -93.12127079579042, -93.13631499984636, -93.14826673723597,
-93.1725207001496, -93.16258646788002, -93.16736408635319,
-93.233861, -93.197707, -93.1695602261766),
lat = c(44.93063027435463, 44.9378965, 44.93793144940074, 44.9818313522325, 44.92995237419055, 44.93402239220738, 44.94929665918013,
44.94312748117745, 44.92536022429271, 44.94608502125374,
44.975941, 44.930711, 44.95613549519296),
fav = c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)
)
stpaul_map_zoom <- get_stamenmap(
bbox = c(left = floor(min(100*fav_places$long))/100, bottom = floor(100*min(fav_places$lat))/100, right = ceiling(max(100 * fav_places$long))/100, top = ceiling(max(100*fav_places$lat))/100),
maptype = "terrain",
zoom = 12)
pal <- colorFactor("plasma", domain = fav_places$fav)
leaflet(data = fav_places) %>%
addProviderTiles(providers$Esri.WorldImagery) %>%
addCircleMarkers(lng = ~long[c(1,3,9,5,4,6,13,8,10,7,11,12,2)],
lat = ~lat[c(1,3,9,5,4,6,13,8,10,7,11,12,2)],
label = ~place[c(1,3,9,5,4,6,13,8,10,7,11,12,2)],
color = ~pal(fav[c(1,3,9,5,4,6,13,8,10,7,11,12,2)]),
opacity = 1,
fill = FALSE)%>%
addPolylines(lng = ~long[c(1,3,9,5,4,6,13,8,10,7,11,12,2)],
lat = ~lat[c(1,3,9,5,4,6,13,8,10,7,11,12,2)]) %>%
addLegend(pal = pal,
values = ~fav,
opacity = 0.5,
title = "Top 3 locations in St. Paul",
position = "bottomright")
I had to do the weird indexing thing for the variables to reorder them for the line.
Create a data set using the tibble() function that has 10-15 rows of your favorite places. The columns will be the name of the location, the latitude, the longitude, and a column that indicates if it is in your top 3 favorite locations or not. For an example of how to use tibble(), look at the favorite_stp_by_lisa I created in the data R code chunk at the beginning.
Create a leaflet map that uses circles to indicate your favorite places. Label them with the name of the place. Choose the base map you like best. Color your 3 favorite places differently than the ones that are not in your top 3 (HINT: colorFactor()). Add a legend that explains what the colors mean.
Connect all your locations together with a line in a meaningful way (you may need to order them differently in the original data).
If there are other variables you want to add that could enhance your plot, do that now.
This section will revisit some datasets we have used previously and bring in a mapping component.
The data come from Washington, DC and cover the last quarter of 2014.
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}. This code reads in the large dataset right away.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. This time, plot the points on top of a map. Use any of the mapping tools you’d like.trips_by_station <-
Stations %>%
select(lat, long, name) %>%
right_join(Trips, by = c("name" = "sstation")) %>%
group_by(long, lat, name) %>%
summarize(trips = n())
tripColor <- colorNumeric("plasma", domain = trips_by_station$trips)
leaflet(data = trips_by_station) %>%
addTiles() %>%
addCircles(lng = ~long,
lat = ~lat,
color = ~tripColor(trips),
opacity = 1,
label = ~name) %>%
addLegend(pal = tripColor,
values = ~trips,
opacity = 0.5,
title = "Trips taken",
position = "bottomright")
trips_by_station <-
Trips %>%
group_by(sstation) %>%
mutate(trips = n()) %>%
group_by(trips, sstation, client) %>%
summarize(casual_count = n()) %>%
filter(client == "Casual") %>%
summarize(casual_prop = casual_count/trips) %>%
left_join(Stations, by = c("sstation" = "name"))
casual_propColor <- colorNumeric("plasma", domain = trips_by_station$casual_prop)
leaflet(data = trips_by_station) %>%
addTiles() %>%
addCircles(lng = ~long,
lat = ~lat,
color = ~casual_propColor(casual_prop),
opacity = 1,
label = ~casual_prop) %>%
addLegend(pal = casual_propColor,
values = ~casual_prop,
opacity = 0.5,
title = "Top 3?",
position = "bottomright")
I notice that the casual riders seem to be renting bikes near big parks, which is understandable, since they would just ride for fun on a weekend, or a tourist would be renting the bike and going on a relaxing ride just for its own sake.
The following exercises will use the COVID-19 data from the NYT.
state_map <- map_data("state")
covid19 %>%
slice_max(date, n = 1) %>%
ggplot() +
geom_map(map = state_map,
aes(map_id = tolower(state),
fill = cases)) +
expand_limits(x = state_map$long, y = state_map$lat) +
theme_map() +
labs(title = "Total Cases as of March 2022",
fill = NULL)
We can see that the graph mostly just represents the more populated states because we’re measuring the sheer number of people that got sick.
state_map <- map_data("state")
covid19_10000_recent <- covid19 %>%
slice_max(date, n = 1) %>%
mutate(state = tolower(state)) %>%
left_join(census_pop_est_2018, by = "state")%>%
mutate(cases10k = cases/est_pop_2018*10000)
covid19_10000_recent %>%
ggplot() +
geom_map(map = state_map,
aes(map_id = tolower(state),
fill = cases10k)) +
expand_limits(x = state_map$long, y = state_map$lat) +
theme_map() +
labs(title = "Total Cases per 10,000 people",
fill = NULL)
state_map <- map_data("state")
covid19_10000 <- covid19 %>%
filter((date =="2020-04-1"|date =="2020-08-1"|date =="2020-12-1"|date =="2021-04-1")) %>%
mutate(state = tolower(state)) %>%
left_join(census_pop_est_2018, by = "state")%>%
mutate(cases10k = cases/est_pop_2018*10000)
covid19_10000 %>%
ggplot() +
geom_map(map = state_map,
aes(map_id = tolower(state),
fill = cases10k)) +
facet_wrap(vars(date))+
expand_limits(x = state_map$long, y = state_map$lat) +
theme_map() +
theme(legend.background = element_blank()) +
labs(title = "Total Coronavirus cases over time",
fill = NULL)
It seems like trends remained pretty similar for which states were doing comparatively better. We also see a lot of growth once we hit winter.
These exercises use the datasets MplsStops and MplsDemo from the carData library. Search for them in Help to find out more information.
MplsStops dataset to find out how many stops there were for each neighborhood and the proportion of stops that were for a suspicious vehicle or person. Sort the results from most to least number of stops. Save this as a dataset called mpls_suspicious and display the table. mpls_suspicious <- MplsStops %>%
filter(problem == "suspicious") %>%
group_by(neighborhood) %>%
summarize(stops = n()) %>%
arrange(stops)
mpls_suspicious
leaflet map and the MplsStops dataset to display each of the stops on a map as a small point. Color the points differently depending on whether they were for suspicious vehicle/person or a traffic stop (the problem variable). HINTS: use addCircleMarkers, set stroke = FAlSE, use colorFactor() to create a palette.susColor <- colorFactor("plasma", domain = MplsStops$problem)
leaflet(data = MplsStops) %>%
addTiles() %>%
addCircles(lng = ~long,
lat = ~lat,
color = ~susColor(problem)) %>%
addLegend(pal = susColor,
values = ~problem,
title = "Stop type")
eval=FALSE. Although it looks like it only links to the .sph file, you need the entire folder of files to create the mpls_nbhd data set. These data contain information about the geometries of the Minneapolis neighborhoods. Using the mpls_nbhd dataset as the base file, join the mpls_suspicious and MplsDemo datasets to it by neighborhood (careful, they are named different things in the different files). Call this new dataset mpls_all.mpls_nbhd <- st_read("Minneapolis_Neighborhoods/Minneapolis_Neighborhoods.shp", quiet = TRUE)
mpls_all <- mpls_nbhd %>%
left_join(MplsDemo, by = c("BDNAME" = "neighborhood")) %>%
left_join(mpls_suspicious, by = c("BDNAME" = "neighborhood")) %>%
mutate(prop_suspicious = stops/population)
leaflet to create a map from the mpls_all data that colors the neighborhoods by prop_suspicious. Display the neighborhood name as you scroll over it. Describe what you observe in the map. suspiciousColor <- colorNumeric("plasma", domain = mpls_all$prop_suspicious)
leaflet(data = mpls_all) %>%
addTiles() %>%
addPolygons(fillColor = ~suspiciousColor(prop_suspicious),
fillOpacity = .6,
label = ~BDNAME,
highlight = highlightOptions(weight = 5,
fillOpacity = 0.9,
bringToFront = FALSE)) %>%
addLegend(pal = suspiciousColor,
values = ~prop_suspicious,
title = str_wrap("Proportion of stops for suspicious behavior"),
position = "bottomright")
Unsurprisingly the urban areas have the highest rate of stops to population (particularly west downtown), due to many factors. I did a little investigating using this code
MplsStops %>%
filter(neighborhood == "Downtown West") %>%
group_by(race) %>%
summarize(count = n())
and again, unsurprisingly, the stops of Black people in Downtown West were much frequent than would be expected for the relatively low Black population of the area. About twice as many Black people are stopped there as white people, despite being ~25% of the population in the area.
leaflet to create a map of your own choosing. Come up with a question you want to try to answer and use the map to help answer that question. Describe what your map shows.What is the proportion of Black people stopped to their proportion of the population in that area?
mpls_sus_race <- MplsStops %>%
filter(problem == "suspicious"& race == "Black") %>%
group_by(race, neighborhood) %>%
summarize(stops_black = n())
mpls_sus_all <- MplsStops %>%
filter(problem == "suspicious"& race != "Unknown" & is.na(race)==FALSE) %>%
group_by(neighborhood) %>%
summarize(stops = n()) %>%
arrange(stops)
mpls_stop_racial <- mpls_nbhd %>%
left_join(MplsDemo, by = c("BDNAME" = "neighborhood")) %>%
left_join(mpls_sus_race, by = c("BDNAME" = "neighborhood")) %>%
left_join(mpls_sus_all, by = c("BDNAME" = "neighborhood")) %>%
select(c(BDNAME, stops_black, black , stops)) %>%
group_by(BDNAME) %>%
mutate(rate_stops_black = (stops_black/stops)/black)
stopsColor <- colorNumeric("plasma", domain = mpls_stop_racial$rate_stop_black)
leaflet(data = mpls_stop_racial) %>%
addTiles() %>%
addPolygons(fillColor = ~stopsColor(rate_stops_black),
label = ~rate_stops_black,
popup = TRUE,
fillOpacity = .8,
highlight = highlightOptions(weight = 5,
fillOpacity = .9)) %>%
addLegend(pal = stopsColor,
values = ~rate_stops_black,
title = "Rate of stopping Black people")
The map itself doesn’t end up being super clear because the scale of the outliers masks the true trend, however the scale catches attention in a different way. This value I was calculating should roughly correspond to how much more likely it is for a Black person to get stopped for “suspicious” activity in that neighborhood than a white person, since it’s the proportion of black people stopped to the proportion of black people in the area. We see that the scale maxes out at 70, which is insane. This would catch the viewer’s eye in the first place, before they realize that it could just be a statistical anomaly. However, (since I couldn’t get the scale to vary logarithmically or anything) when mousing over each county they will see that the rates are almost all greater than 1, and nearly always by a statistically significant amount. It’s worth noting I have excluded the “Unknown” and NA values for race in the traffic stop data, assuming that they are reflective of the average distribution. While we don’t have enough data to support the outlying neighborhoods, the rest of the data reinforces the trend to a more accurate level in other areas. Some neighborhoods were also missing either the rate of black denizens, or the rate of black car stops.
DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?